-
Notifications
You must be signed in to change notification settings - Fork 857
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
No closing connections mode #1845
Conversation
Ugg... I had a long reply written and somehow going back and forth between the "Conversion" and "Files Changed" tabs I lost it. So sorry that this reply might be a bit terse. pgx used to work this way years ago. It led to a lot of edge cases that were very difficult to be sure were reliably handled. They were also almost impossible to test. In addition, any context cancellation can cause an unrecoverable situation (e.g. interrupted write over TLS is fatal). The current behavior is intentional. I agree it would be better if a context cancellation didn't close the underlying connection but it is harder than it looks. I don't see any obvious bugs with the PR, but this is not a "no obvious bugs" feature -- it is a "obviously no bugs" feature. I'm not actually sure what that would look like though -- which is why I switched to closing connections instead of improving the recover connections on context cancellation code that existed. My concerns are more general than specific, but I do have a little concrete feedback regarding the PR:
But as I said above, I don't really have any better ideas either... |
d60b138
to
8e99ffb
Compare
@jackc thank you for your feedback! I left only one public function LaunchCleanup that does all the job of cleaning. I removed cleanup when transaction is commited/rollbacked with already canceled context as its a corner case and happens very rarely on practice and its hard to handle this case without overincreasing public interfaces. I have also added tests on new functionality. About option concern, i think we have only one way here to leave this as an option and if it will be widely used we could make it as default behaviour. As you said there will be two paths but we need them both, with recover for pgbouncer and with closings for working with tls for example. Also i think for now we dont want to break everything again with switching back to We need cleanup only for pgbouncer as it is single-threaded and poor-performing but, unfortunately, it is already an industry standard thats why i hope we could somehow solve this issue as a lot of highload systems need this fix. |
I'm still not convinced this is the right approach. And while, unfortunately, I still do not have a complete picture of what the right approach is, my thoughts have coalesced a little bit. I don't think that this should be a mode. If it is correct and reliable it would always be preferable to the current close on error behavior. The only reason for a mode is to limit the damage if there was a bug. This change would be backwards compatible. It is and will always be possible for a context cancellation to break the connection (e.g. an interrupted write over TLS) and it is and will always be possible for a context cancellation to not break the connection (e.g. the context is cancelled before a query even starts). So calling code always needs to handle an interrupted query resulting in a broken connection or the connection still being usable. This change would simply increase the fraction of non-broken connections. I'd like to approach this primarily from the point of view of pgconn not of pgxpool. There may need to be a hook for pgxpool to determine the state of a released connection but I don't think it should need anything more than that. And ideally, someone using pgconn directly would never need to explicitly call any recovery logic. I think we would want to implement this cancellation recovery logic at only very specific places where we can be sure of the state of the connection and how to clean it up. For example, only while waiting / reading results from a I think this is as simple as we could get, but yet it still has complicated edge cases. In many cases it would be best to call
Or it may even be desirable to make the cancellation recovery process entirely customizable via a application provided function. Not exactly sure what that would look like, but I am pretty sure that has been requested at some point in the past. |
@jackc thank you for your feedback and explanations! I have remade this pr and put all recover logic inside pgconn. I have put recover logic into ResultReader and MultiResultReader when we are trying to read new messages (but in fact we could put recover in every place where we are trying to read data cause as far as i know every postgres command ends with ReadyForQuery message. But to start with it is in ResultReader and MultiResultReader only). I wasn`t able to make all recovery process customizable but i made the part where we send cancel request customizable. Cancel request is by default but there is an option to custom OnRecover function or disable it completely, because in quick queries it is most likely that query is executed and response is sent before cancel request is sent, so it would just waste resources, so we would like to have an option to disable sending cancel requests. I think i remember the issue where cancellation recovery process was requested to be customizable because of the cancel requests, someone would like to disable them. P.S. Unfortunately i didnt understand the last part of your feedback |
Sorry, just didnt merge master, fixed build failure, checked that https://github.com/zcolleen/pgx/actions/runs/7769417704 , now everything should be fine with tests |
I looked over the code and it is an improvement to have the logic in pgconn. But as I mentioned in a previous reply, I'm still unconvinced this is the correct overall approach. This was bringing back vague memories so I looked back in the previous versions. We had something like this in v3. There was async cancel query and I don't see anything specifically wrong. But I also am having difficulty fully understanding how it works. Given my previous experience with finding it difficult to reason about a solution to this problem I am really nervous about adding the feature. Maybe if we step back a bit there is another approach that will solve your underlying issue with PgBouncer and context cancellation. I think one of the fundamental issues is there are multiple valid ways an application may want a context cancellation to be handled. One is to unblock the Go code right now. Another is to ask the PostgreSQL server to cancel the query but do not unblock the Go code. Another approach is to wait a short time to see if the query will complete naturally, then send a cancel request, then wait a while longer, then finally interrupt the connection if the server-side cancellation hasn't finished. Right now, pgx takes the first approach exclusively -- unblock the Go code at all costs. It uses I think that recovering from arbitrary network interruption is too complex. However, I think we should be able to get the desired functionality without recovering from network interruption. Perhaps we should consider exposing a function or object that allows the calling application to control what happens when the context is canceled. It could wait, send a cancellation request, or close the connection when and as it saw fit. That should allow directly handling all cases except where the caller wants to immediately unblock the Go code and wants to preserve the connection. But that can be handled by the caller using a goroutine or wrapper object. Another advantage of this approach is that alternative means for query cancellation could be used. e.g. another connection could be used to call |
Here's a proof of concept of the approach I described: #1894. It doesn't prevent adding recovery the ability to recover from a deadline interrupting a network read, but I think it makes it unnecessary. |
Okay, now i get your idea. But for me it has some disadvantages:
But if you are sure we are not going by recovery road and not giving it second chances than lets move to your idea. |
In fact i would have to write the entire library to support context cancelation with disabled Handler |
I don't think this is accurate.
The default HandleCancel implementation has no changes in behavior from existing. It sets a deadline when the context is canceled and clears it at the end of the query method -- exactly as the current logic does.
It's true that this only happens when the context is canceled during the query, not when it is canceled before the query, but again, that is the same behavior that exists now. There is no need for canceling a query or recovering from an interruption because the query is never sent.
I am leaning against adding recovery logic at this time. But the proposed |
I mean if i would redefine Handler, for example put long deadline to wait after cancel for the query to finish naturally, in this case i wont get any context canceled error. Thats why i thought of expanding Handler interface just a little so that HandleUnwatchAfterCancel returns error and teach Unwatch to return error. With this we could at least get delayed error of context cancelation if we want to. What do you think? |
So the problem is that if a handler delays interrupting the query such that the query is successfully completed that the query still returns a context error even though no error should be returned? So the problem isn't that Handler changes the returned error -- the problem is it doesn't or can't change the returned error. Do I understand correctly? |
Opposite. I made this short example to clarify everything https://github.com/zcolleen/pgx_test/blob/no-error-example/main.go . In this example error is not returned while , as i see the correct behaviour of the library , there should be a context error . |
Ah, I see now. I actually don't think there should be an error. Returning the context error says the query failed because of a context cancellation. But the query did complete successfully. If the caller wants to know if the context is canceled they can check the context itself after the fact. |
Thats true on the one hand, but on the other it makes using library very inconvenient because user would have to check context after each query. Thats why i thought of adding error to HandleUnwatchAfterCancel() so that if you want to get error after context canceled, return it (you may return any other error, not only context canceled error). But if you dont want to get error, just return nil in HandleUnwatchAfterCancel (that is by default). Thats how i see this feature. But anyway the best outcome for me is pr with recover because, while it lets the query finish, it keeps two core features: |
I guess I still don't understand this use case. Can you give me an example or pseudo code where it would be useful to know that the context was canceled while the query method was executing in spite of the fact that the query method succeeded? |
Sure https://github.com/zcolleen/pgx_test/blob/context-check-example/main.go . That is also why the pr with recover is better for this case, because if we return immediately after context is canceled, we are not holding any other resources such as grpc or http connection more than deadline in context. |
Thanks, I see what's going on. But IMO the correct solution is for |
That is the correct one, but we do not always know that we have to handle cancellation on our own. Imagine that As the developer of a wrapper above pgx which gives pre-configured pgx to work with bouncer, i wouldnt want that my wrapper code relies on code that my users would write after pgx query call. So if there would be an option to return error from pgx i would definitely use it, to return error to my wrapper users. |
Hi @jackc! Any updates here? Have you made up your mind about the approach to this issue? Do you think we could merge this pr into some experemental branch and give it an experemental tag to see if its working good in production? |
@zcolleen Unfortunately, I still think it is too difficult to reason about. I've spent enough hours tracking down subtle concurrency issues in pgx to error on the side of caution for future changes. Sorry, but I don't think this PR can be merged. |
@jackc well, anyway thanks for discussion and spent time on this issue |
Hi @jackc!
This is a big story of performance issue with PgBouncer.
In highload systems there are many context cancelations. When context canceles , pgx closes connection asynchronously. And here the performance issue of PgBouncer comes. When there are a lot of connections closings, bouncer starts to serve a lot of closings and he cant manage with other payload and the whole system starts failing. This issue was disscused many times but its still not resolved, so, unfortunately, its impossible to use pgxpool in highload with bouncer with deadline in context.
In our company we solved this issue and we were using fork of pgx v4 https://github.com/ozontech/pgx/tree/ozontech-patch but v4 is quit old, so i decided to make patch in v5 and come to you with it.
This issue is very hard to reproduce at home as it is connected with highload, but at least i made this example https://github.com/zcolleen/pgx_test . I started postgres with bouncer and ran the program. I have chosen the timeout when i would have about 20 percent of traffic with context cancelations (i made a kind of "metrics", handler /failed/deadline is showing percent of cancelations) . As time passes, the percent is growing because bouncer cant manage with this type of load, so at the end percent is about 90. Than i changed pgx to my fork and with it the percent of canceles is not growing, staying as it was at start.
The patch is quiet simple. When context canceles, we return error to user but we do not close connection. Opposite, we clean the socket asynchronously from data that was left from previous request and just return it to the pool. As practice shows, for pgbouncer its much more easier than closing connections, so, even if you have a lot of context canceles, the whole system wont fail.
I made a new option in config that changes mechanics, by default the behaviour is not changed.
I hope you would have some time to take a look at this pr and give some feedback. Thanks in advance!
P.S. This mechanics is not something new. database/sql is doing quit the same things.